†Department of Computer Science and Engineering, The Chinese University of Hong Kong
1155157061@link.cuhk.edu.hk {wxwang,yxwan9,lyu}@cse.cuhk.edu.hk
‡Tencent AI Lab
arXiv:2303.13648v1 [cs.CL] 15 Mar 2023
ChatGPT is a cutting-edge artificial intelli- gence language model developed by OpenAI, which has attracted a lot of attention due to its surprisingly strong ability in answering follow-up questions. In this report, we aim to evaluate ChatGPT on the Grammatical Er- ror Correction (GEC) task, and compare it with commercial GEC product (e.g., Gram- marly) and state-of-the-art models (e.g., GEC- ToR). By testing on the CoNLL2014 bench- mark dataset, we find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics (e.g., F0.5 score), particularly on long sentences. We inspect the outputs and find that ChatGPT goes be- yond one-by-one corrections. Specifically, it prefers to change the surface expression of certain phrases or sentence structure while maintaining grammatical correctness. Human evaluation quantitatively confirms this and suggests that ChatGPT produces less under- correction or mis-correction issues but more over-corrections. These results demonstrate that ChatGPT is severely under-estimated by the automatic evaluation metrics and could be a promising tool for GEC.
ChatGPT1, the current “super-star” in artificial in- telligence (AI) area, has attracted millions of reg- istered users within just a week since its launch by OpenAI. One of the reasons for ChatGPT being so popular is its surprisingly strong per- formance on various natural language process- ing (NLP) tasks (Bang et al., 2023), including ques- tion answering (Omar et al., 2023), text summariza- tion (Yang et al., 2023), machine translation (Jiao et al., 2023), logic reasoning (Frieder et al., 2023), code debugging (Xia and Zhang, 2023), etc. There is also a trend of using ChatGPT as a writing assis- tant for text polishing.
Despite the widespread use of ChatGPT, it re- mains unclear to the NLP community that to what extent ChatGPT is capable of revising the text and correcting grammatical errors. To fill this research gap, we empirically study the Grammatical Error Correction (GEC) ability of ChatGPT by evalu- ating on the CoNLL2014 benchmark dataset (Ng et al., 2014), and comparing its performance to Grammarly, a prevalent cloud-based English typing assistant with 30 million users daily (Grammarly, 2023) and GECToR (Omelianchuk et al., 2020), a state-of-the-art GEC model. With this study, we aim to answer a research question:
Is ChatGPT a good tool for GEC?
To the best of our knowledge, this is the first study on ChatGPT’s ability in GEC.
We present the major insights gained from this evaluation as below:
ChatGPT performs worse than the baseline systems in terms of the automatic evaluation metrics (e.g., F0.5 score), particularly on long sentences.
ChatGPT goes beyond one-by-one corrections by introducing more changes to the surface expression of certain phrases or sentence struc- ture while maintaining the grammatical cor- rectness.
Human evaluation quantitatively demon- strates that ChatGPT produces less under- correction or mis-correction issues but more over-corrections.
Our evaluation indicates the limitation of relying solely on automatic evaluation metrics to assess the performance of GEC models and suggests that ChatGPT is a promising tool for GEC.
Preposition I sat in the talk I sat in on the talk Morphology dreamed dreamt Determiner I like the ice cream I like ice cream Tense/Aspect I like play basketball I like playing basketball Syntax I have not the book I do not have the book Punctuation We met they talked and left We met, they talked and left
Table 1: Different types of error in GEC.
the general public due to its strong ability in an- swering various follow-up questions, correcting inappropriate questions (Zhong et al., 2023), and even refusing illegal questions. While the tech- nical details of ChatGPT have not been released systematically, it is known to be built upon Instruct- GPT (Ouyang et al., 2022) which is trained using instruction tuning (Wei et al., 2022a) and reinforce- ment learning from human feedback (RLHF, Chris- tiano et al., 2017).
Grammatical Error Correction (GEC) is a task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors (Ruder, 2022). It is highly demanded as writing plays an important role in academics, work, and daily life. Table 1 presents the illustra- tion of different grammatical errors borrowed from Bryant et al. (2022) in a comprehensive survey on grammatical error correction. In general, gram- matical errors can be roughly classified into three categories: omission errors, such as "on" in the first example; replacement errors, such as "dreamed" for "dreamt" in the second example; and insertion errors, such as "the" in the third example.
To evaluate the performance of GEC, researchers have built various benchmark datasets, which in- clude but are not limited to:
Table 2: GEC performance of GECToR, Grammarly, and ChatGPT.
2 Background | System | Precision | Recall | F0.5 |
2.1 ChatGPT | GECToR | 71.2 | 38.4 | 60.8 |
ChatGPT is an intelligent chatbot powered by large | Grammarly | 67.3 | 51.1 | 63.3 |
language models developed by OpenAI. It has at- | ChatGPT | 51.2 | 62.8 | 53.1 |
tracted great attention from industry, academia, and |
but introduces a new dataset, namely, the Write&Improve+LOCNESS corpus, which represents a wider range of native and learner English levels and abilities (Bryant et al., 2019).
JFLEG: It represents a broad range of lan- guage proficiency levels and uses holistic flu- ency edits to not only correct grammatical errors but also make the original text more native sounding (Tetreault et al., 2017).
Dataset. We evaluate the ability of ChatGPT in grammatical error correction on the CoNLL2014 task (Ng et al., 2014) dataset. The dataset is com- posed by short paragraphs that are written by non- native speakers of English, accompanied with the corresponding annotations on the grammatical er- rors. We pulled 100 sentences from the official- combined test set in the alternate folder of the dataset sequentially.
Evaluation Metric. To evaluate the performance of GEC, we adopt three metrics that are widely used in literature, namely, Precision, Recall, and F0.5 score. Among them, F0.5 score combines both Precision and Recall, where Precision is assigned a higher weight (Wikipedia contributors, 2023a).
Precision | Recall | F0.5 | Precision | Recall | F0.5 | Precision | Recall | F0.5 | ||||
GECToR | 76.9 | 38.5 | 64.1 | 68.8 | 37.5 | 58.9 | 71.8 | 38.9 | 61.5 | |||
Grammarly | 62.5 | 60.6 | 62.1 | 68.9 | 56.0 | 65.9 | 67.3 | 45.3 | 61.4 | |||
ChatGPT | 58.5 | 66.7 | 60.0 | 48.7 | 60.7 | 50.7 | 51.0 | 62.8 | 53.0 |
System Short Medium Long
Table 3: GEC performance with respect to sentence length.
Specifically, the three metrics are expressed as:
the grammar correction in the setting and only ask it to correct the ones with correctness prob-
TP
Precision =
TP + FP
, (1)
lems (red underline), while leaving the clarity (blue underline), engagement (green underline)
TP
Recall =
TP + FN
, (2)
and delivery (purple underline) unchanged. We iterate this process several times until there is no
F0.5
1.25 × Precision × Recall
= , (3)
0.25 × Precision + Recall
error detected by Grammarly.
where TP , FP and FN represent the true posi- tives, false positives and false negatives of the pre- dictions, respectively. We use the scoring program provided by CoNLL2014 official but adapt it to be compatible with the latest Python environment.
Do grammatical error correction on all the following sentences I type in the conversation.
We query ChatGPT with this prompt for each test sample.
Grammarly: Grammarly is a prevalent cloud- based English typing assistant. It reviews spelling, grammar, punctuation, clarity, engage- ment, and delivery mistakes in English texts, detects plagiarism and suggests replacements for the identified errors (Wikipedia contribu- tors, 2023b). As stated by Grammarly, every day, 30 million people and 50,000 teams around the world use Grammarly with their writing (Grammarly, 2023). When querying Grammarly, we open a text file and paste all the test sam- ples into separate paragraphs. We enable all
ChatGPT with GECToR (Omelianchuk et al., 2020), a state-of-the-art model on GEC in re- search, which also exhibits good performance on the CoNLL2014 task. We adopt the imple- mentation based on the pre-trained RoBERTa model.
Overall Performance. Table 2 presents the over- all performance of the three systems. As seen, ChatGPT obtains the highest recall value, GECToR obtains the highest precision value, while Gram- marly achieves a better balance between the two metrics and results in the highest F0.5 score. These results suggest that ChatGPT tends to correct as many errors as possible, which may lead to more overcorrections. Instead, GECToR corrects only those it is confident about, which leaves many er- rors uncorrected. Grammarly combines the advan- tages of both such that it performs more stably.
Source For an example , if exercising is helpful for family potential disease , we can always look for more chances for the family to go exercise .
Reference For example , if exercising (OR exercise) is helpful for a potential family disease
, we can always look for more chances for the family to do exercise .
GECToR For example , if exercising is helpful for family potential disease , we can always look for more chances for the family to go exercise .
Grammarly For example , if exercising is helpful for a family ’s potential disease , we can always look for more chances for the family to go exercise .
ChatGPT For example , if exercise is helpful in preventing potential family diseases , we can always look for more opportunities for the family to exercise .
Table 4: Comparison of the outputs from different GEC systems.
Table 5: GEC performance with Grammarly for further correction.
ChatGPT is not limited to correcting the errors in the one-by-one fashion. Instead, it is more will- ing to change the superficial expression of some phrases or the sentence structure. For example, in Table 4, GECToR and Grammarly make mi- nor changes to the source sentence (i.e., “an ex- ample” to “example”, “family potential disease” to “a family ’s potential disease”), while ChatGPT modifies the sentence structure (i.e., “for family potential disease” to “in preventing potential fam- ily diseases”) and word choice (i.e., “chances” to “opportunities”). It indicates that the outputs by ChatGPT maintain the grammatical correctness, al- though they do not follow the original expression of the source sentences.
To validate our hypothesis, we let Grammarly to further correct the grammatical errors in the out- puts of GECToR and ChatGPT. Table 5 lists the results. We can observe that Grammarly introduces a negligible improvement to the output of ChatGPT, demonstrating that ChatGPT indeed generates cor- rect sentences. On the contrary, Grammarly further improves the performance of GECToR noticeably (i.e., +2.1 F0.5, +16.5 Recall), suggesting that there are still many errors in the output of GECToR.
System | Precision | Recall | F0.5 | System | #Under | #Mis | #Over | |
GECToR | 71.2 | 38.4 | 60.8 | GECToR | 13 | 4 | 0 | |
+ Grammarly | -5.9 | +16.5 | +2.1 | Grammarly | 14 | 0 | 1 | |
ChatGPT | 51.2 | 62.8 | 53.1 | ChatGPT | 3 | 3 | 30 | |
+ Grammarly | +0.4 | +0.8 | +0.5 |
Human Evaluation. We conduct a human eval- uation to further demonstrate the potential of Chat- GPT for the GEC task. Specifically, we fol- low Wang et al. (2022) to manually annotate the issues in the outputs of the three systems, includ- ing 1) Under-correction, which is the grammati- cal errors that are not found; 2) Mis-correction, which is the grammatical errors that are found but modified incorrectly; it can be either grammati- cally incorrect or semantically incorrect; 3) Over- correction, which is the other modifications beyond the changes in the reference. We sample 20 sen- tences out of the 100 test sentences and ask two annotators to identify the issues. Table 6 shows the results. Obviously, ChatGPT has the least num- ber of under-corrections among the three systems and fewer number of mis-corrections compared with GECToR, which suggests its great potential in grammatical error correction. Meanwhile, Chat- GPT produces more over-corrections, which may come from the diverse generation ability as a large language model. While this usually leads to a lower F0.5 score, it also allows more flexible language expressions in GEC.
different behaviors of ChatGPT and Grammarly. The slight improvement (i.e., +0.5 F0.5) by Gram- marly mainly comes from punctuation problems. ChatGPT is not sensitive to punctuation problems but Grammarly is, though the modifications are not always correct. For example, when we manually undo the corrections on punctuation, the F0.5 score increases by +0.0015. Other than punctuation prob- lems, Grammarly also corrects a few grammatical errors on articles, prepositions, and plurals. How- ever, these corrections usually require Grammarly to repeat the process twice. Take the following sentence as an example,
... constructs of the family and kinship are a social construct,
...
Grammarly first changes it to
... constructs of the family and kinship are a social constructs,
...
Then, changes it to
... constructs of the family and kinship are social constructs,
...
Nonetheless, it does correct some errors that Chat- GPT fails to correct.
This paper evaluates ChatGPT on the task of Gram- matical Error Correction (GEC). By testing on the CoNLL2014 benchmark dataset, we find that Chat- GPT performs worse than a commercial product Grammarly and a state-of-the-art model GECToR in terms of automatic evaluation metrics. By ex- amining the outputs, we find that ChatGPT dis- plays a unique ability to go beyond one-by-one corrections by changing surface expressions and sentence structure while maintaining grammatical correctness. Human evaluation results confirm this finding and reveals that ChatGPT produces fewer under-correction or mis-correction issues but more over-corrections. These results demonstrate the limitation of relying solely on automatic evaluation metrics to assess the performance of GEC models and suggest that ChatGPT has the potential to be a valuable tool for GEC.
There are several limitations in this version, which we leave for future work:
More Prompt and In-context Learning: In this version, we only use one prompt to query ChatGPT and do not utilize the advanced tech- nology from the in-context learning field, such as providing demonstration examples (Brown et al., 2020) or providing chain-of-thought (Wei et al., 2022b), which may under-estimate the full po- tential of ChatGPT. In our future work, we will explore the in-context learning methods for GEC to improve its performance.
More Evaluation Metrics: In this version, we only adopt Precision, Recall and F0.5 as evalu- ation metrics. In our future work, we will uti- lize more metrics, such as pretraining-based met- rics (Gong et al., 2022) to evaluate the perfor- mance comprehensively.
Grammarly. 2023. Grammarly website about us page. Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing
Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good translator? a preliminary study. In ArXiv.
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem N. Chernodub, and Oleksandr Skurzhanskyi. 2020. Gector – grammatical error correction: Tag, not rewrite. In Workshop on Innovative Use of NLP for Building Educational Applications.
Sebastian Ruder. 2022. NLP-progress.
Wikipedia contributors. 2023a. F-score — Wikipedia, the free encyclopedia. [Online; accessed 5-March- 2023].
Wikipedia contributors. 2023b. Grammarly — Wikipedia, the free encyclopedia. [Online; accessed 2-March-2023].
Chun Xia and Lingming Zhang. 2023. Conversational automated program repair. ArXiv.